library(tidyverse)
library(gridExtra)
library(emmeans)
library(ggpubr)
library(kableExtra)
library(grid)
options(width=100)

Rename this file and the folder it is in with your student number (e.g., “1912345.Rmd”) Submit this rmd file and the knitted html as a zip file (e.g., 1912345.zip) by right clicking on the folder to zip it. That is, the zip file 1912345.zip should contain the folder 1912345 with only the files 1912345.Rmd and 1912345.html —

Submitted by: <2284896>

Date Sent: 15th , Jan, 2023

Module Title: Business Statistics

Module Code: IB94X0

Date/Year of Module: Term One 2022-2023

Submission Deadline: 12:00 (UK time) Monday, 16th , Jan, 2023

Word Count: —-

Number of Pages: 10

Question: [ [80%] Assignment 2 ]

“I declare that this work is entirely my own in accordance with the University’s Regulation 11 and the WBS guidelines on plagiarism and collusion. All external references and sources are clearly acknowledged and identified within the contents.

No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done it may result in me being reported for self-plagiarism and an appropriate reduction in marks may be made when marking this piece of work.”

Question 1

Section 1

Variable Description
Country Country (England, Northern Ireland, Wales) [Column 1]
LAType Local authorities type [Column 2]
LAName Local authorities name [Column 3]
Totalestablishments(includingnotyetrated&outside) Count of total establishments (including not yet rated and outside) [Column 4]
Establishmentsnotyetratedforintervention Count of not yet rated establishments for intervention [Column 5]
Total%ofBroadlyCompliantestablishments-A Total percent of broadly compliant establishments (rated A) [Column 10]
Total%ofInterventionsachieved(premisesratedA-E) Total percent of successful interventions of establishments (rated A-E) [Column 19]
Total%ofInterventionsachieved-premisesratedA Total percent of successful interventions of establishments(rated A) [Column 20]
Total%ofInterventionsachieved-premisesratedB Total percent of successful interventions of establishments(rated B) [Column 21]
Total%ofInterventionsachieved-premisesratedC Total percent of successful interventions of establishments(rated C) [Column 22]
Total%ofInterventionsachieved-premisesratedD Total percent of successful interventions of establishments(rated D) [Column 23]
Total%ofInterventionsachieved-premisesratedE Total percent of successful interventions of establishments(rated E) [Column 24]
Total%ofInterventionsachieved-premisesnotyetrated Total percent of successful interventions of not yet rated establishments [Column 25]
ProfessionalFullTimeEquivalentPosts-occupied * Number of employees [Column 36]

__*Note: There will be new columns in the following analysis…__

data<-read_csv("Desktop/BS_FInal/2019-20-enforcement-data-food-hygiene.csv")
#str(data) #check the structure of the data.
#summary(data) #check the data overview.

#we find that in Interventionsachieved.premisesratedA has some weird value "NR", so I let the column as numeric again.
data$`Total%ofInterventionsachieved-premisesratedA`<-as.numeric(data$`Total%ofInterventionsachieved-premisesratedA`)

#check NAs in each column of the data
for(i in colnames(data)){print(table(is.na(data[,i])))}
## 
## FALSE 
##   353 
## 
## FALSE 
##   353 
## 
## FALSE 
##   353 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   323    30 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6 
## 
## FALSE  TRUE 
##   347     6
#To see what happened in those rows with NAs
data[which(is.na(data[,4]==T)),]
## # A tibble: 6 × 36
##   Country LAType      LAName Total…¹ Estab…² Estab…³ Total…⁴ Total…⁵ Arate…⁶ Total…⁷ Brate…⁸ Total…⁹
##   <chr>   <chr>       <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>     <dbl>   <dbl>
## 1 England District C… Broxb…      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 2 England District C… Chorl…      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 3 England District C… East …      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 4 England District C… Hyndb…      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 5 England District C… Tamwo…      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## 6 England Metropolit… West …      NA      NA      NA      NA      NA      NA <NA>         NA      NA
## # … with 24 more variables: Cratedestablishments <dbl>,
## #   `Total%ofBroadlyCompliantestablishments-C` <dbl>, Dratedestablishments <dbl>,
## #   `Total%ofBroadlyCompliantestablishments-D` <dbl>, Eratedestablishments <dbl>,
## #   `Total%ofBroadlyCompliantestablishments-E` <dbl>,
## #   `Total%ofInterventionsachieved(premisesratedA-E)` <dbl>,
## #   `Total%ofInterventionsachieved-premisesratedA` <dbl>,
## #   `Total%ofInterventionsachieved-premisesratedB` <dbl>, …
#seems like all NAs are occured in the same row, thus we use na.omit to delete all of them.

#Clean NAs
newdata<-na.omit(data)
newdata<-newdata%>%filter(newdata$`Total%ofInterventionsachieved-premisesratedA`!="NR")

Create a plot that clearly shows the distribution across the Local Authorities (LAs) of the % of enforcement actions successfully achieved. The plot below shows the distribution across the Local Authorities of the % of enforcement actions successfully achieved for all impact levels combined.

I use jitter plot to show the distribution because I think in this case, using jitter plot with alpha, we can figure out the density of each percentage achieved in LAs. The color is darker, the density is higher.

overall<-ggplot(newdata)+
  geom_jitter(aes(LAName,`Total%ofInterventionsachieved(premisesratedA-E)`),color="red",alpha=0.2)+
  facet_wrap(~LAType)+
  labs(title = "The % of enforcement achieved",
       x="Each point stands for LA belongs to different LA types",
       y="Percentage achieved(%)")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())
overall

rankA<-ggplot(newdata)+
  geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedA`),color="red",alpha=0.2)+
  facet_wrap(~LAType)+
  labs(title="A rank % achievement",x="",y="")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

rankB<-ggplot(newdata)+
  geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedB`),color="red",alpha=0.2)+
  facet_wrap(~LAType)+
  labs(title="B rank % achievement",x="",y="")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

rankC<-ggplot(newdata)+
  geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedC`),color="red",alpha=0.2)+
  facet_wrap(~LAType)+
  labs(title="C rank % achievement",x="",y="")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

rankD<-ggplot(newdata)+
  geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedD`),color="red",alpha=0.2)+
  facet_wrap(~LAType)+
  labs(title="D rank % achievement",x="",y="")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

rankE<-ggplot(newdata)+
  geom_jitter(aes(LAName,`Total%ofInterventionsachieved-premisesratedE`),color="red",alpha=0.2)+
  facet_wrap(~LAType)+
  labs(title="E rank % achievement",x="",y="")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Allrank<-ggarrange(rankA, rankB, rankC, rankD, rankE, labels = NULL ,common.legend = TRUE, legend = "bottom",align = "hv",font.label = list(size = 10, color = "black", face = "bold", family = NULL, position = "top"))

annotate_figure(Allrank, left = textGrob("Percentage achieved(%)", rot = 90, vjust = 1, gp = gpar(cex = 1.3)), bottom = textGrob("Each point stands for LA belongs to different LA types", gp = gpar(cex = 1.3)))

Here, I categorize the percentage achievement in every rank by every 10%(10090,9080,80~70…etc) and using histogram plot to show the distribution since histogram is more clear than jitter plot if we want to see the distribution as a whole. Also, I rename the columns for quicker coding.

colnames(newdata)[19] ="overall"
newdata<-newdata%>%mutate(all_rank_segment=case_when(0.00<=newdata["overall"]&newdata["overall"]<10.00 ~ "0~10%",
                                       10.00<=newdata["overall"]&newdata["overall"]<20.00 ~ "10%~20%",
                                       20.00<=newdata["overall"]&newdata["overall"]<30.00 ~ "20%~30%",
                                       30.00<=newdata["overall"]&newdata["overall"]<40.00 ~ "30%~40%",
                                       40.00<=newdata["overall"]&newdata["overall"]<50.00 ~ "40%~50%",
                                       50.00<=newdata["overall"]&newdata["overall"]<60.00 ~ "50%~60%",
                                       60.00<=newdata["overall"]&newdata["overall"]<70.00 ~ "60%~70%",
                                    70.00<=newdata["overall"]&newdata["overall"]<80.00 ~ "70%~80%",
                                    80.00<=newdata["overall"]&newdata["overall"]<90.00 ~ "80%~90%",
                                    90.00<=newdata["overall"]&newdata["overall"]<=100.00 ~ "90%~100%"))

colnames(newdata)[20] ="A_rank"
newdata<-newdata%>%mutate(A_rank_segment=case_when(0.00<=newdata["A_rank"]&newdata["A_rank"]<10.00 ~ "0~10%",
                                       10.00<=newdata["A_rank"]&newdata["A_rank"]<20.00 ~ "10%~20%",
                                       20.00<=newdata["A_rank"]&newdata["A_rank"]<30.00 ~ "20%~30%",
                                       30.00<=newdata["A_rank"]&newdata["A_rank"]<40.00 ~ "30%~40%",
                                       40.00<=newdata["A_rank"]&newdata["A_rank"]<50.00 ~ "40%~50%",
                                       50.00<=newdata["A_rank"]&newdata["A_rank"]<60.00 ~ "50%~60%",
                                       60.00<=newdata["A_rank"]&newdata["A_rank"]<70.00 ~ "60%~70%",
                                    70.00<=newdata["A_rank"]&newdata["A_rank"]<80.00 ~ "70%~80%",
                                    80.00<=newdata["A_rank"]&newdata["A_rank"]<90.00 ~ "80%~90%",
                                    90.00<=newdata["A_rank"]&newdata["A_rank"]<=100.00 ~ "90%~100%"))

colnames(newdata)[21] ="B_rank"
newdata<-newdata%>%mutate(B_rank_segment=case_when(0.00<=newdata["B_rank"]&newdata["B_rank"]<10.00 ~ "0~10%",
                                       10.00<=newdata["B_rank"]&newdata["B_rank"]<20.00 ~ "10%~20%",
                                       20.00<=newdata["B_rank"]&newdata["B_rank"]<30.00 ~ "20%~30%",
                                       30.00<=newdata["B_rank"]&newdata["B_rank"]<40.00 ~ "30%~40%",
                                       40.00<=newdata["B_rank"]&newdata["B_rank"]<50.00 ~ "40%~50%",
                                       50.00<=newdata["B_rank"]&newdata["B_rank"]<60.00 ~ "50%~60%",
                                       60.00<=newdata["B_rank"]&newdata["B_rank"]<70.00 ~ "60%~70%",
                                    70.00<=newdata["B_rank"]&newdata["B_rank"]<80.00 ~ "70%~80%",
                                    80.00<=newdata["B_rank"]&newdata["B_rank"]<90.00 ~ "80%~90%",
                                    90.00<=newdata["B_rank"]&newdata["B_rank"]<=100.00 ~ "90%~100%"))

colnames(newdata)[22] ="C_rank"
newdata<-newdata%>%mutate(C_rank_segment=case_when(0.00<=newdata["C_rank"]&newdata["C_rank"]<10.00 ~ "0~10%",
                                       10.00<=newdata["C_rank"]&newdata["C_rank"]<20.00 ~ "10%~20%",
                                       20.00<=newdata["C_rank"]&newdata["C_rank"]<30.00 ~ "20%~30%",
                                       30.00<=newdata["C_rank"]&newdata["C_rank"]<40.00 ~ "30%~40%",
                                       40.00<=newdata["C_rank"]&newdata["C_rank"]<50.00 ~ "40%~50%",
                                       50.00<=newdata["C_rank"]&newdata["C_rank"]<60.00 ~ "50%~60%",
                                       60.00<=newdata["C_rank"]&newdata["C_rank"]<70.00 ~ "60%~70%",
                                    70.00<=newdata["C_rank"]&newdata["C_rank"]<80.00 ~ "70%~80%",
                                    80.00<=newdata["C_rank"]&newdata["C_rank"]<90.00 ~ "80%~90%",
                                    90.00<=newdata["C_rank"]&newdata["C_rank"]<=100.00 ~ "90%~100%"))

colnames(newdata)[23] ="D_rank"
newdata<-newdata%>%mutate(D_rank_segment=case_when(0.00<=newdata["D_rank"]&newdata["D_rank"]<10.00 ~ "0~10%",
                                       10.00<=newdata["D_rank"]&newdata["D_rank"]<20.00 ~ "10%~20%",
                                       20.00<=newdata["D_rank"]&newdata["D_rank"]<30.00 ~ "20%~30%",
                                       30.00<=newdata["D_rank"]&newdata["D_rank"]<40.00 ~ "30%~40%",
                                       40.00<=newdata["D_rank"]&newdata["D_rank"]<50.00 ~ "40%~50%",
                                       50.00<=newdata["D_rank"]&newdata["D_rank"]<60.00 ~ "50%~60%",
                                       60.00<=newdata["D_rank"]&newdata["D_rank"]<70.00 ~ "60%~70%",
                                    70.00<=newdata["D_rank"]&newdata["D_rank"]<80.00 ~ "70%~80%",
                                    80.00<=newdata["D_rank"]&newdata["D_rank"]<90.00 ~ "80%~90%",
                                    90.00<=newdata["D_rank"]&newdata["D_rank"]<=100.00 ~ "90%~100%"))

colnames(newdata)[24] ="E_rank"
newdata<-newdata%>%mutate(E_rank_segment=case_when(0.00<=newdata["E_rank"]&newdata["E_rank"]<10.00 ~ "0~10%",
                                       10.00<=newdata["E_rank"]&newdata["E_rank"]<20.00 ~ "10%~20%",
                                       20.00<=newdata["E_rank"]&newdata["E_rank"]<30.00 ~ "20%~30%",
                                       30.00<=newdata["E_rank"]&newdata["E_rank"]<40.00 ~ "30%~40%",
                                       40.00<=newdata["E_rank"]&newdata["E_rank"]<50.00 ~ "40%~50%",
                                       50.00<=newdata["E_rank"]&newdata["E_rank"]<60.00 ~ "50%~60%",
                                       60.00<=newdata["E_rank"]&newdata["E_rank"]<70.00 ~ "60%~70%",
                                    70.00<=newdata["E_rank"]&newdata["E_rank"]<80.00 ~ "70%~80%",
                                    80.00<=newdata["E_rank"]&newdata["E_rank"]<90.00 ~ "80%~90%",
                                    90.00<=newdata["E_rank"]&newdata["E_rank"]<=100.00 ~ "90%~100%"))
all<-ggplot(newdata,aes(overall))+geom_histogram(fill="cyan",color="black")+xlim(0,101)+
  labs(title="The % of enforcement achieved across all rank",x="% achieved",y="Total numbers of LA")
all

A<-ggplot(newdata,aes(A_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
  labs(title="Rank A % achievement",x="% achieved",y="Total numbers of LA")

B<-ggplot(newdata,aes(B_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
  labs(title="Rank B % achievement",x="% achieved",y="Total numbers of LA")

C<-ggplot(newdata,aes(C_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
  labs(title="Rank C % achievement",x="% achieved",y="Total numbers of LA")

D<-ggplot(newdata,aes(D_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
  labs(title="Rank D % achievement",x="% achieved",y="Total numbers of LA")

E<-ggplot(newdata,aes(E_rank))+geom_histogram(fill="red",color="black")+xlim(0,101)+
  labs(title="Rank E % achievement",x="% achieved",y="Total numbers of LA")
ggarrange(A,B,C,D,E)

Examine whether there is a relationship between proportion of successful responses and the number of FTE food safety employees in each local authority. Examine whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority – essentially creating your own measure of how many food safety employees there are per establishment.

#Firstly, I sum up the interventions and name the column as "total_intervention_achieved".
#For every relationship test, I want to plot them first so we can double check the linear model with the plot.
newdata<-newdata%>%mutate(total_intervention_achieved=rowSums(newdata[,26:35])) #sum up all the interventions and mutate as a column

newdata<-newdata%>%mutate(interventions_achieved_proportion=total_intervention_achieved*100/(`Totalestablishments(includingnotyetrated&outside)`-Establishmentsnotyetratedforintervention)) #Calculate the percentage achieved

ggplot(newdata,aes(`ProfessionalFullTimeEquivalentPosts-occupied *`,interventions_achieved_proportion))+geom_point()+geom_smooth(method="lm")+labs(title = "The relationship of successful responses(%) and the number of FTE employees",x="How many FTE employees sent to LA?",y="The proportion of successful responses(%) for each LA") #Plot it

emp<-lm(interventions_achieved_proportion~`ProfessionalFullTimeEquivalentPosts-occupied *`,data=newdata)
summary(emp) 
## 
## Call:
## lm(formula = interventions_achieved_proportion ~ `ProfessionalFullTimeEquivalentPosts-occupied *`, 
##     data = newdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.486  -6.962  -0.167   7.088  35.017 
## 
## Coefficients:
##                                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                       24.6386     1.1804  20.873  < 2e-16 ***
## `ProfessionalFullTimeEquivalentPosts-occupied *`   0.9670     0.2402   4.026 7.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.91 on 321 degrees of freedom
## Multiple R-squared:  0.04807,    Adjusted R-squared:  0.0451 
## F-statistic: 16.21 on 1 and 321 DF,  p-value: 7.084e-05
#we observe that every additional employee send to supervise the restaurant will increase the achievement rate by roughly 1%(0.967%). 
#And also, the effect of number of employees sent has significant impact on the achievement rate.

Examine whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority – essentially creating your own measure of how many food safety employees there are per establishment.

newdata<-newdata%>%mutate(employee_rate=newdata$`ProfessionalFullTimeEquivalentPosts-occupied *`/(`Totalestablishments(includingnotyetrated&outside)`-newdata$Establishmentsnotyetratedforintervention))
Q3<-lm(interventions_achieved_proportion~employee_rate,data = newdata)
Q3
## 
## Call:
## lm(formula = interventions_achieved_proportion ~ employee_rate, 
##     data = newdata)
## 
## Coefficients:
##   (Intercept)  employee_rate  
##         24.53        1458.15
summary(Q3) #We can see that the coefficient of employee rate is significant(t-value=2.51, p-value=0.0126<0.05)
## 
## Call:
## lm(formula = interventions_achieved_proportion ~ employee_rate, 
##     data = newdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.204  -6.889  -0.176   7.196  36.011 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     24.533      1.776   13.81   <2e-16 ***
## employee_rate 1458.147    580.967    2.51   0.0126 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.08 on 321 degrees of freedom
## Multiple R-squared:  0.01925,    Adjusted R-squared:  0.01619 
## F-statistic: 6.299 on 1 and 321 DF,  p-value: 0.01257
#Which means that the proportion of employees does impact the proportion of interventions successfully achieved

ggplot(newdata,aes(employee_rate,interventions_achieved_proportion))+geom_point()+geom_smooth(method="lm")+labs(title="Employee per establishment V.S. Interventions achieved proportion",x="Employee per establishment", y="Intervention achieved rate(%)")


Section 2

The following report is the analysis of food safety supervised by local authorities of England, Wales, and Northern Ireland. The managers of Food Standards Agency and politicians would like to see how well the food safety is, and how can they improve, if necessary, and maintain the current situation. Before we head to the requested questions, let’s browse the data and understand the current situation first. There are 353 observations, which is 353 local authorities in total. But 30 of them have no available data ( NAs and misleading value such as NP), so the total data in the analysis report is 323 local authorities with 36 variables. In Table 1., we can see that in England there are four types of local authorities: “District Council”, “London Borough”, “Metropolitan Borough Council”, and “Unitary Authority”. But in other two countries, there are only one type of local authorities. The District Council in England holds more than 50% of local authorities.

Table 1. The distribution of Local Authorities in each Country
Country LAType Number
England District Council 169
England London Borough 33
England Metropolitan Borough Council 35
England Unitary Authority 54
Northern Ireland NI Unitary Authority 10
Wales Welsh Unitary Authority 22

The managers of Food Standards Agency would like to know the percentage of establishments successfully respond to enforcement actions as a whole. Here we have the distribution across all the local authorities of the percentage of enforcement actions successfully achieved. The histogram below shows the percentage of enforcement achieved among all ranks from A to E. We can see the distribution is left-skewed, which means most of local authorities have done a good job to let the establishments achieved the food safety standards. However, we still find some authorities did not do well, only have 50% or lower percentage achieved.

But, I suggest that there might be differences for establishments rated A, B, C, D, and E. So I create this plot to show the distribution in each rank. The plot here presents that there are fewer numbers in rank A, and rank E seems to have the most number of LAs. However, the distribution of ranks looks all similar in histogram plot and we cannot have more insights. Thus I decide to make another jitter plot to see the distribution.

The jitter plot can show the distribution in a more clear way. Each spot in the graph represent one LA, the transparency of the spot means the density of LAs in the same level(y-axis). For example, if the color is pink, means there are not so many LAs in that level(y-axis) but is the color is red, means there are many LAs in the same level(y-axis).

I split the LA types to see which one performs worst so we know where can we improve first. This graph basically shows us that in A rank, most authorities have nearly 100% achieved but in District Council there are some LA didn’t do well. The reason might be there are too many authorities so it is hard to keep every LA in a highly performances. The same situations happened in all other ranks except E. The distribution in E rank shows that many authorities did not do well on E rank establishment as they did on other ranks. I suggest they might think E rank is the least dangerous, or emergency, establishments so they are a little bit loose when supervised them.

Then, the manager also wants to know whether employing more professional enforcement officers increases the likelihood of establishments successfully responding to interventions? To study this question, I want to see if there is a relationship between proportion of successful responses and the number of employees sent. Firstly, I sum up all interventions for each LA and divided by the total establishments to get the proportion of establishments that achieved the interventions. Then, I build a model to see if there is any relationship between numbers of employees and the percentage of achievements.

The plot below shows that there is a positive correlation between the number of employees in each LA and the percentage of successful responses in each LA. I observe that every additional employee send to supervise the restaurant will increase the achievement rate by roughly 1%(0.967%). And we can conclude that the effect of number of employees sent has significant impact on the achievement rate(t-value=4.026, p-value<0.05).

But we know that sometimes using proportion is better than using real numbers to do analysis since there might be some authorities that has many employees but fewer cases, and others have fewer employees but need to deal with more cases. So, I want to examine whether there is a relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority. I divide the number of employees by total establishment to get the proportion and build a model to test the correlation of the proportion of intervention achieved and proportion of employee. In the graph we can easily see that there is a positive correlation between them. The higher proportion of employee rate is, the higher proportion of intervention successfully achieved will be.


Question 2

Section 1

Variable Description
sold by Sold by which company
publisher.type Publisher type
genre Three genre (children, fiction, non_fiction)
avg.review Average review of books
daily.sales Daily sales of books
total.reviews Total review of books
sale.price The price of books
publisher_sales <- read_csv("Desktop/BS_FInal/publisher_sales.csv") #load the data
publisher_sales<-na.omit(publisher_sales) #check NAs in the data
summary(publisher_sales)
##    sold by          publisher.type        genre             avg.review     daily.sales    
##  Length:6000        Length:6000        Length:6000        Min.   :0.000   Min.   : -0.53  
##  Class :character   Class :character   Class :character   1st Qu.:4.100   1st Qu.: 56.77  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.400   Median : 74.29  
##                                                           Mean   :4.267   Mean   : 79.11  
##                                                           3rd Qu.:4.620   3rd Qu.: 98.02  
##                                                           Max.   :4.980   Max.   :207.98  
##  total.reviews     sale.price    
##  Min.   :  0.0   Min.   : 0.740  
##  1st Qu.:105.0   1st Qu.: 7.140  
##  Median :128.0   Median : 8.630  
##  Mean   :132.6   Mean   : 8.641  
##  3rd Qu.:163.0   3rd Qu.:10.160  
##  Max.   :248.0   Max.   :17.460
str(publisher_sales)
## tibble [6,000 × 7] (S3: tbl_df/tbl/data.frame)
##  $ sold by       : chr [1:6000] "Random House LLC" "Amazon Digital Services,  Inc." "Amazon Digital Services,  Inc." "Amazon Digital Services,  Inc." ...
##  $ publisher.type: chr [1:6000] "big five" "indie" "small/medium" "small/medium" ...
##  $ genre         : chr [1:6000] "childrens" "non_fiction" "non_fiction" "fiction" ...
##  $ avg.review    : num [1:6000] 4.44 4.19 3.71 4.72 4.65 4.81 4.33 4.21 3.95 4.66 ...
##  $ daily.sales   : num [1:6000] 61.5 74.9 66 85.2 37.7 ...
##  $ total.reviews : num [1:6000] 92 130 118 179 111 106 205 86 161 81 ...
##  $ sale.price    : num [1:6000] 8.03 9.08 9.48 12.32 5.78 ...

Do books from different genres have different daily sales on average?

#Firstly, I want to visualize the whole data for better understanding.
company<-table(publisher_sales$`sold by`)%>%as.data.frame()
Books<-ggplot(company,aes(x=reorder(Var1,-Freq),Freq))+
  geom_bar(stat = "identity",color="black",fill="cyan")+labs(title="The variety of books sold by company",x="Company Names",y="The variety of Books")+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Books

#In this chart, we can see that ADS has the most variety of books for sale, following by Random House LLC and Penguin Group(USA) LLC.
pub<-table(publisher_sales$publisher.type)%>%as.data.frame()
Books_pub<-ggplot(pub,aes(x=reorder(Var1,-Freq),Freq))+
  geom_bar(stat = "identity",color="black",fill="purple")+labs(title="The variety of books by publisher type",x="Publisher Types",y="The variety of Books")+
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
Books_pub

#In this chart, we can see that small/medium firms have the most variety of books for sale. Surprisingly, amazon holds the last place of the variety of books.
#To check the difference between genres, we have to categorize and summarize them by genres:
dailysales<-publisher_sales%>%group_by(genre)%>%summarise(sales=mean(daily.sales))
dailysales_pub<-publisher_sales%>%group_by(publisher.type,genre)%>%summarise(sales=mean(daily.sales))
#Then, we may want to visulize the data to have a more straightforward look:
#Below is the plot grouped by genres only:
sales_plot_by_genres<-ggplot(dailysales,aes(x=reorder(genre,-sales),y=sales))+
  geom_bar(stat = "identity",color="black",fill="pink")+
  geom_text(aes(label=paste(round(sales,2),'books per day'),vjust=2),size=4)+
  labs(title="Avg sales by Genres",x="Genres",y="Sales(books per day)")
sales_plot_by_genres

#Below is the plot grouped by genres and publishers:
sales_plot_by_publisher<-ggplot(dailysales_pub,aes(x=reorder(genre,-sales),y=sales,fill=publisher.type))+
  geom_bar(stat = "identity",color="black",position = "dodge")+
  labs(title="Avg Sales by Genres and Publishers",x="Genres",y="Sales(books per day)")
sales_plot_by_publisher

#Now we can use anova to see the difference in different genres
anova_sales<-aov(daily.sales~genre,data=publisher_sales)
summary(anova_sales)
##               Df  Sum Sq Mean Sq F value Pr(>F)    
## genre          2 2562528 1281264    2590 <2e-16 ***
## Residuals   5997 2966133     495                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans(anova_sales,~genre)
##  genre       emmean    SE   df lower.CL upper.CL
##  childrens     55.6 0.497 5997     54.6     56.6
##  fiction      105.9 0.497 5997    104.9    106.9
##  non_fiction   75.9 0.497 5997     74.9     76.8
## 
## Confidence level used: 0.95
#Now, the anova shows us that the genre has a F-value of 2590 and p-value is extremely small.
#Thus we can conclude that the sales between different genres have significant differences.

Do books have more/fewer sales depending upon their average review scores and total number of reviews.

#Firstly, I created two lm models, one without interaction term and the other with interaction term between review scores and total reviews.
publisher_sales<-publisher_sales%>%filter(total.reviews!=0 & avg.review!=0)#filter out those with 0 reviews and the score is 0(because the score is 0 due to no reviews) to prevent any impact on the linear model

nointeraction<-lm(daily.sales~total.reviews+avg.review, data=publisher_sales) #The model without interaction term
interaction<-lm(daily.sales~total.reviews+avg.review+total.reviews*avg.review, data=publisher_sales) #The model with interaction term
summary(interaction)
## 
## Call:
## lm(formula = daily.sales ~ total.reviews + avg.review + total.reviews * 
##     avg.review, data = publisher_sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.327  -14.583   -0.773   13.859   93.915 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -7.36525    9.34333  -0.788    0.431    
## total.reviews             0.65854    0.06740   9.770   <2e-16 ***
## avg.review                2.61997    2.16273   1.211    0.226    
## total.reviews:avg.review -0.02181    0.01560  -1.398    0.162    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.21 on 5973 degrees of freedom
## Multiple R-squared:  0.4653, Adjusted R-squared:  0.465 
## F-statistic:  1733 on 3 and 5973 DF,  p-value: < 2.2e-16
summary(nointeraction)
## 
## Call:
## lm(formula = daily.sales ~ total.reviews + avg.review, data = publisher_sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.334  -14.632   -0.742   13.839   93.473 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.163958   2.651313   1.948   0.0515 .  
## total.reviews  0.564926   0.007838  72.077   <2e-16 ***
## avg.review    -0.299177   0.565828  -0.529   0.5970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.22 on 5974 degrees of freedom
## Multiple R-squared:  0.4651, Adjusted R-squared:  0.465 
## F-statistic:  2598 on 2 and 5974 DF,  p-value: < 2.2e-16
#Then, I compared these two models to find out which one with/without interaction term is the better one?
anova(nointeraction,interaction) #It seems like with/without interaction term, the model has not many differences.(F=0.162, p-value>0.05)
## Analysis of Variance Table
## 
## Model 1: daily.sales ~ total.reviews + avg.review
## Model 2: daily.sales ~ total.reviews + avg.review + total.reviews * avg.review
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1   5974 2948425                           
## 2   5973 2947460  1    965.05 1.9557  0.162
#In both models, the avg review score is not significant so we can conclude that the avg review score does not impact the sales of books.


#I want to test if I put only avg review score in the model, will this variable be significant or not. Because in intuition, we suggest that the higher review score is, the higher sales the book would have.
onlyscore<-lm(daily.sales~avg.review,data=publisher_sales)
summary(onlyscore) #But it seems like the avg review score is higher or not doesn't have much impact on sales
## 
## Call:
## lm(formula = daily.sales ~ avg.review, data = publisher_sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.858 -22.338  -4.808  18.985 128.932 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  79.7954     3.3371  23.912   <2e-16 ***
## avg.review   -0.1618     0.7736  -0.209    0.834    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.37 on 5975 degrees of freedom
## Multiple R-squared:  7.317e-06,  Adjusted R-squared:  -0.00016 
## F-statistic: 0.04372 on 1 and 5975 DF,  p-value: 0.8344
#For every average review score increase by 1, book sales decrease only 0.1618 books(p-value=0.834>0.05)



#Visualize the data with the point plot.
publisher_sales$publisher.type<-as.factor(publisher_sales$genre)

ggplot(publisher_sales,aes(total.reviews,daily.sales,color=genre))+geom_point(alpha=0.2)+
  labs(title = "Number of reviews V.S. Daily Sales",x="Total number of reviews",y="Average Daily Sales(Books/per day)")+geom_smooth(method="lm")

ggplot(publisher_sales,aes(avg.review,daily.sales,color=genre))+geom_point(alpha=0.2)+
  labs(title = "Review Scores V.S. Daily Sales",x="Average Review Scores",y="Average Daily Sales(Books/per day)")+geom_smooth(method="lm")

What is the effect of sale price upon the number of sales, and is this different across genres?

sales_genre<-lm(daily.sales~sale.price+genre,data=publisher_sales) #Without interaction term
sales_genre_interaction<-lm(daily.sales~sale.price+genre+genre*sale.price,data=publisher_sales) #With interaction term
anova(sales_genre,sales_genre_interaction) #Compare two models
## Analysis of Variance Table
## 
## Model 1: daily.sales ~ sale.price + genre
## Model 2: daily.sales ~ sale.price + genre + genre * sale.price
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   5973 2938502                                  
## 2   5971 2928492  2     10010 10.205 3.764e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#In the anova table, we can find that the model with interaction term is different from the model without interaction term (F-value=10.205, p-value<=0.01). Thus, the model with interaction term is the better model that can explain the data more.
summary(sales_genre_interaction)
## 
## Call:
## lm(formula = daily.sales ~ sale.price + genre + genre * sale.price, 
##     data = publisher_sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -102.407  -13.372    0.039   13.088  102.343 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  72.8120     2.5067  29.047  < 2e-16 ***
## sale.price                   -1.7265     0.2459  -7.021 2.45e-12 ***
## genrefiction                 35.2911     3.2793  10.762  < 2e-16 ***
## genrenon_fiction              6.6036     3.2085   2.058  0.03962 *  
## sale.price:genrefiction       1.4531     0.3551   4.092 4.34e-05 ***
## sale.price:genrenon_fiction   1.2767     0.3474   3.675  0.00024 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.15 on 5971 degrees of freedom
## Multiple R-squared:  0.4688, Adjusted R-squared:  0.4683 
## F-statistic:  1054 on 5 and 5971 DF,  p-value: < 2.2e-16
# Main effect of sale price tells us that there are different numbers of sales in different prices
# Main effect of genre tells us that there are a different number of sales among children, fiction, and nonfiction books.
# Interaction tells us that the pattern of sales across sale prices is different for every genres.

emmeans(sales_genre_interaction,~genre+sale.price)
##  genre       sale.price emmean    SE   df lower.CL upper.CL
##  childrens         8.64   57.9 0.597 5971     56.7     59.1
##  fiction           8.64  105.7 0.521 5971    104.7    106.8
##  non_fiction       8.64   75.5 0.528 5971     74.5     76.6
## 
## Confidence level used: 0.95
summary(sales_genre_interaction.emm <- emmeans(sales_genre_interaction,~genre+sale.price))
##  genre       sale.price emmean    SE   df lower.CL upper.CL
##  childrens         8.64   57.9 0.597 5971     56.7     59.1
##  fiction           8.64  105.7 0.521 5971    104.7    106.8
##  non_fiction       8.64   75.5 0.528 5971     74.5     76.6
## 
## Confidence level used: 0.95
genre<-lm(daily.sales~genre,data=publisher_sales)
genre
## 
## Call:
## lm(formula = daily.sales ~ genre, data = publisher_sales)
## 
## Coefficients:
##      (Intercept)      genrefiction  genrenon_fiction  
##            55.56             50.35             20.30
price<-lm(daily.sales~sale.price,data = publisher_sales)
price
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = publisher_sales)
## 
## Coefficients:
## (Intercept)   sale.price  
##     112.093       -3.818
#Filter out all genre into independent data and use regression to see the correlation
g_fiction<-publisher_sales%>%filter(genre=="fiction")
g_nonfiction<-publisher_sales%>%filter(genre=="non_fiction")
g_children<-publisher_sales%>%filter(genre=="childrens")
lm(daily.sales~sale.price,data=g_fiction) #When sale price goes up for $1, the sales will decrease by 0.2732 books
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = g_fiction)
## 
## Coefficients:
## (Intercept)   sale.price  
##    108.1030      -0.2734
lm(daily.sales~sale.price,data=g_nonfiction) #When sale price goes up for $1, the sales will decrease by 0.4502 books
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = g_nonfiction)
## 
## Coefficients:
## (Intercept)   sale.price  
##     79.4156      -0.4498
lm(daily.sales~sale.price,data=g_children) #When sale price goes up for $1, the sales will decrease by 1.732 books
## 
## Call:
## lm(formula = daily.sales ~ sale.price, data = g_children)
## 
## Coefficients:
## (Intercept)   sale.price  
##      72.812       -1.727
EMM<-ggplot(summary(sales_genre_interaction.emm), mapping=aes(x=genre, y=emmean, ymin=lower.CL, ymax=upper.CL))+
    geom_point(color="red")+
    geom_linerange(aes(ymin=lower.CL, ymax=upper.CL))+ylim(c(55,110))+
    labs(y="Estimated Sales(Number of books)", x="Genres",
         subtitle="Error bars are 95% CIs",
         title="Difference in Sales across Genres")+
  geom_hline(mapping=aes(yintercept=lower.CL), lty=2)+
  geom_hline(mapping=aes(yintercept=upper.CL), lty=2)
EMM

Effect<-ggplot(publisher_sales,aes(sale.price,daily.sales,color=genre))+
  geom_point(alpha=0.2)+
  geom_smooth(method="lm")+
  labs(title="The effect of sale price upon the number of sales and genres",x="Book Price(£)",y="Average Daily Sales(books/day)")
Effect

Seperate<-ggplot(publisher_sales,aes(sale.price,daily.sales))+
  geom_point(color="red",alpha=0.2)+
  geom_smooth(method="lm")+
  labs(title="The effect of sale price upon the number of sales and genres",x="Book Price (£)",y="Average Daily Sales(books/day)")+facet_wrap(genre~.)
Seperate


Section 2

The following analysis report delve into the data of e-book sales over several months among multiple sellers, including Amazon Digital Service, Random House LLC, and Harper Collins Publishers etc… The managers of a publishing company request to know the relationships and coefficient across genres, average daily sales, book prices(£), and review scores. The report will bring the audience through the data and deal with the above problem in a clear and vivid way.

Let’s take a brief look at the data first. In Table 1., we can see that ADS dominate the variety of e-book sold in the market, following by Random House LLC and Penguin Group (USA) LLC. The bar plot of Figure 1. shows the data more clearly. There are two sellers only have one kind of e-book sold; however, it will not impact the following analysis since we only want to know the relationship of book sales, book prices, review scores, and genres. So we will keep those data.

Table 1. The variety of books of every organization
Organizations Variety
1 Amazon Digital Services, Inc.  4311
11 Random House LLC 442
10 Penguin Group (USA) LLC 307
13 Simon and Schuster Digital Sales Inc 272
6 HarperCollins Publishers 261
9 Macmillan 175
4 Hachette Book Group 149
5 HarperCollins Christian Publishing 28
3 DC Comics 24
7 HarperCollins Publishing 24
8 Idea & Design Works 5
2 Cengage Learning 1
12 RCS MediaGroup S.p.A. 1

Then, let’s look at Figure 2. below, we found that the variety of book is not dominated by one kind of publisher and surprisingly, Amazon is in the last place. Small/Medium publishers have the most abundant types of e-books sold(more than 2000+).

After this, we are diving deeper and look at the data that we care the most: the book sales, prices(£), review scores, and genres.

The first question is that there are many book genres and people do have their favorite one. But is there any type of genre that just sell better than others? Or any type of genre is less popular than others? In other words, we want to know do books from different genres have different daily sales on average?

In the following images, we uncover the average sales among genres and average sales across genres and publishers. We can clearly see that fiction books is the best seller among all genres with 105.89 books sold per day, following by non fiction books with 75.87 books sold per day, and the last place goes to books for children with only 55.58 books sold per day. After statistical test with F-value=2585, p-value<0.05, we conclude that the genres of fictions, non-fictions, and children have significantly different sales and fiction e-books has the highest sales of all. The Difference in Sales across Genres plot shows the estimated sales of each genre. None of them overlapped, so we can say the sales of each genre is quite different. The sales difference among these genres just makes sense. Since the declining birth rate is impacting the world, e-books for children is the one sold the worst, and with the fast development and cutting-edge technology springing out, people are more fascinated by fiction movies and novels. So, fiction e-books is the best sellers.

If we delve deeper into those genres, we see that all publishers almost have the same sales in the same genres. This implies there might not be so many differences in terms of book sales if books are published by different publisher.

Secondly, when people search the nearest restaurant for dinner, find the most popular site to visit, they always browse the comments and review score to decide where to go. Thus, we also want to know if the same behavior happened on buying e-books? In other words, we want to know do e-books have more/fewer sales depending upon their average review scores and total number of reviews?

The graph below shows the relationship more clearly. The correlation of numbers of review and daily sales is positive and the positive correlation occurs in all genres. That is, the more reviews a book has, the more amount the book is sold. However, it doesn’t mean that if a book has more reviews than we can expect it to be the best sellers. The correlation does not mean there is a causality.

The graph below gives us the correlation between review scores and the daily sales of books in all genres. There is no significant correlation(F-value=0.044, p-value=0.834>0.05) between the review scores and the sales. Which is counter intuitive. I suggest the reason is that the data points are too concentrate in score between 4~5 and sales between 50~150 books, so the correlation is very weak.

Finally, when people buy things, they always put price on the first priority. Does this principle also applied when they buy e-books? So, we want to know the effect of sale price upon the number of sales, and is this different across genres? Here is the plot that shows the relationship between sale price and the number of sales across all genres. We can clearly see that when the sale price increase, the number of sales decrease and for every £1 pncreased in sale price, the overall number of sales will decrease by 3.816 books. Thus, there is a negative correlation between sale price and the number of sales.

However, if we take genres into account, we uncover that in fiction and non fiction books, there are no significant correlation between sale price and number of sales. For fiction books, if the sale price increases by £1, the number of books sold will decrease by 0.2732 books. For non-fiction books, if the sale price increases by £1, the number of books sold will decrease by 0.4502 books. The genre of children does have significant negative correlation between book price and number of sales. When the price of books for children increase by £1, the number of books will decrease by 1.732 books. I suggest the reason behind the scene is that parents buy an e-book for children but kids might lose interests quickly and do not read them again, so they would not willing to pay much for it.

Conclusion: The report shows us that the e-book sales among genres (fiction, non-fiction, children) are different. The best seller is fiction e-books, and the worst is e-books for children. The number of reviews does have positive correlation with the number of books sold; however, the review score does not have correlation with the number of books sold. Finally, the book prices does have negative correlation with the sales of books overall. But interestingly, sales of fiction and non-fiction e-books do not have negative correlation with the book price, only e-books for children does have negative correlation.